One of the key activities of any IT function is to ensure there is no impact to the Business operations. IT leverages Incident Management process to achieve the above Objective. An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact.
In most of the organizations, incidents are created by various Business and IT Users, End Users/ Vendors if they have access to ticketing systems, and from the integrated monitoring systems and tools. Assigning the incidents to the appropriate person or unit in the support team has critical importance to provide improved user satisfaction while ensuring better allocation of support resources.
Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. On the other hand, manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
In the support process, incoming incidents are analyzed and assessed by organization’s support teams to fulfill the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings.
Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams. Incase L1 / L2 is unable to resolve, they will then escalate / assign the tickets to Functional teams from Applications and Infrastructure (L3 teams). Some portions of incidents are directly assigned to L3 teams by either Monitoring tools or Callers / Requestors. L3 teams will carry out detailed diagnosis and resolve the incidents. Around ~56% of incidents are resolved by Functional / L3 teams. Incase if vendor support is needed, they will reach out for their support towards incident closure.
L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3 teams. During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service.
Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.
Milestone 1: Pre-Processing, Data Visualisation and EDA
Milestone 2: Model Building
Milestone 3: Test the Model, Fine-tuning and Repeat
Section to import all necessary packages. Install the libraries which are not included in Anaconda distribution by default using pypi channel or conda forge
!pip install ftfy wordcloud goslate spacy plotly cufflinks
conda install -c conda-forge ftfy wordcloud goslate spacy plotly cufflinks
# Utilities
from time import time
from PIL import Image
from zipfile import ZipFile
import os, sys, itertools, re
import warnings, pickle, string
from ftfy import fix_encoding, fix_text, badness
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# Translation APIs
from goslate import Goslate # Provided by Google
# Numerical calculation
import numpy as np
# Data Handling
import pandas as pd
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import cufflinks as cf
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
# Sequential Modeling
import keras.backend as K
from keras.datasets import imdb
from keras.models import Sequential, Model
from keras.layers.merge import Concatenate
from keras.layers import Input, Dropout, Flatten, Dense, Embedding, LSTM, GRU
from keras.layers import BatchNormalization, TimeDistributed, Conv1D, MaxPooling1D
from keras.constraints import max_norm, unit_norm
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping, ModelCheckpoint
# Traditional Modeling
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# Tools & Evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, auc
from sklearn.metrics import roc_curve, accuracy_score, precision_recall_curve
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
# NLP toolkits
import spacy
import nltk
from nltk import tokenize
# Configure for any default setting of any library
warnings.filterwarnings('ignore')
get_ipython().magic(u'matplotlib inline')
plt.style.use('ggplot')
init_notebook_mode(connected=True)
cf.go_offline()
%matplotlib inline
Mount the drive and set the project path to cureent working directory, when running in Google Colab. No changes are required in case of running in Local PC.
# Block which runs on both Google Colab and Local PC without any modification
if 'google.colab' in sys.modules:
project_path = "/content/drive/My Drive/Colab Notebooks/DLCP/Capstone-NLP/"
# Google Colab lib
from google.colab import drive
# Mount the drive
drive.mount('/content/drive/', force_remount=True)
sys.path.append(project_path)
%cd $project_path
# Let's look at the sys path
print('Current working directory', os.getcwd())
# Load the dataset into a Pandas dataframe called ticket and check the head of the dataset
ticket = pd.read_excel('Input Data Synthetic (created but not used in our project).xlsx', )
ticket.head()
# Check the tail of the dataset
ticket.tail()
Comments
The dataset is divided into two parts, namely, feature matrix and the response vector.
# Get the shape and size of the dataset
print('No of rows:\033[1m', ticket.shape[0], '\033[0m')
print('No of cols:\033[1m', ticket.shape[1], '\033[0m')
# Get more info on it
# 1. Name of the columns
# 2. Find the data types of each columns
# 3. Look for any null/missing values
ticket.info()
# Describe the dataset with various summary and statistics
ticket.describe()
# Check the Short description of tickets having Description as only 'the'
ticket[ticket.Description == 'the'].head()
# Find out the null value counts in each column
ticket.isnull().sum()
Observations
# Let's look at the rows with null values
ticket[pd.isnull(ticket).any(axis=1)]
# NULL replacement
ticket.fillna(str(), inplace=True)
ticket[pd.isnull(ticket).any(axis=1)]
# verify the replacement
ticket.isnull().sum()
Comments:
Mojibake is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, often from a different writing system.
This display may include the generic replacement character ("�") in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding, when the same binary code constitutes one symbol in the other encoding. This is either because of differing constant length encoding (as in Asian 16-bit encodings vs European 8-bit encodings), or the use of variable length encodings (notably UTF-8 and UTF-16). Few such Mojibakes are ¶, ç, å, €, æ, œ, º, ‡, ¼, ¥ etc.
As we're dealing with Natural Language and the source of the data is unknown to us, let's run the encoding check to figure out if the dataset is Mojibake impacted.
The library ftfy (Fixes Text For You) has a greater ability to detect, fix and deal with such Mojibakes. It fixes Unicode that’s broken in various ways. The goal of ftfy is to take in bad Unicode and output good Unicode.
Installation:
using pypi: !pip install ftfy
using conda: conda install -c conda-forge ftfy
# Write a function to apply to the dataset to detect Mojibakes
def is_mojibake_impacted(text):
if not badness.sequence_weirdness(text):
# nothing weird, should be okay
return True
try:
text.encode('sloppy-windows-1252')
except UnicodeEncodeError:
# Not CP-1252 encodable, probably fine
return True
else:
# Encodable as CP-1252, Mojibake alert level high
return False
# Check the dataset for mojibake impact
ticket[~ticket.iloc[:,:-1].applymap(is_mojibake_impacted).all(1)]
# Take an example of row# 8471 Short Desc and fix it
print('Grabled text: \033[1m%s\033[0m\nFixed text: \033[1m%s\033[0m' % (ticket['Short description'][8471],
fix_text(ticket['Short description'][8471])))
# List all mojibakes defined in ftfy library
print('\nMojibake Symbol RegEx:\n', badness.MOJIBAKE_SYMBOL_RE.pattern)
# Sanitize the dataset from Mojibakes
ticket['Short description'] = ticket['Short description'].apply(fix_text)
ticket['Description'] = ticket['Description'].apply(fix_text)
# Visualize that row# 8471
ticket.iloc[8471,:]
# Serialize the mojibake treated dataset
ticket.to_csv('mojibake_treated.csv', index=False, encoding='utf_8_sig')
with open('mojibake_treated.pkl', 'wb') as handle:
pickle.dump(ticket, handle, protocol=pickle.HIGHEST_PROTOCOL)
Comments:
badness.sequence_weirdness() determines how often a text has unexpected characters or sequences of characters. This metric is used to disambiguate when text should be re-decoded or left as is.Goslate is an open source python library that implemented Google Translate API. This uses the Google Translate Ajax API to make calls to such methods as detect and translate. It is choosen over another library Googletrans from Google as Goslate is developed to bypass the ticketing mechanism to prevent simple crawler program to access the Ajax API. Hence Goslate with multiple service urls is able to translate the entire dataset in very few iterations without blocking the user's IP address.
Installation:
using pypi: !pip install goslate
using conda: conda install -c conda-forge goslate
Servicce Urls used:
translate.google.com, translate.google.com.au, translate.google.com.ar, translate.google.co.kr, translate.google.co.in, translate.google.co.jp, translate.google.at, translate.google.de, translate.google.ru, translate.google.ch, translate.google.fr, translate.google.es, translate.google.ae
# Define and construct the service urls
svc_domains = ['.com','.com.au','.com.ar','.co.kr','.co.in','.co.jp','.at','.de','.ru','.ch','.fr','.es','.ae']
svc_urls = ['http://translate.google' + domain for domain in svc_domains]
# # Take an example of row# 8471 Short Desc and fix it
# gs = Goslate(service_urls=svc_urls)
# trans_8471 = gs.translate(ticket['Short description'][8471], target_language='en', source_language='auto')
# print('Original text: \033[1m%s\033[0m\nFixed text: \033[1m%s\033[0m' % (ticket['Short description'][8471], trans_8471))
print('Original text: \033[1m%s\033[0m\nFixed text: \033[1m%s\033[0m' % ('电脑开机开不出来', 'Boot the computer does not really come out'))
# List of column data to consider for translation
trans_cols = ['Short description','Description']
# Add a new column to store the detected language
ticket.insert(loc=2, column='Language', value=np.nan, allow_duplicates=True)
for idx in range(ticket.shape[0]):
# Instantiate Goslate class in each iteration
gs = Goslate(service_urls=svc_urls)
lang = gs.detect(' '.join(ticket.loc[idx, trans_cols].tolist()))
row_iter = gs.translate(ticket.loc[idx, trans_cols].tolist(),
target_language='en',
source_language='auto')
ticket.loc[idx, trans_cols] = list(row_iter)
ticket.Language = lang
ticket.head()
# Serialize the translated dataset
ticket.to_csv('translated_ticket.csv', index=False, encoding='utf_8_sig')
with open('translated_ticket.pkl','wb') as f:
pickle.dump(ticket, f, pickle.HIGHEST_PROTOCOL)
# Load the translated pickle file incase the IP gets blocked
with open('translated_ticket.pkl','rb') as f:
ticket = pickle.load(f)
Comments:
Text preprocessing is the process of transferring text from human language to machine-readable format for further processing. After a text is obtained, we start with text normalization. Text normalization includes:
# Define regex patterns
EMAIL_PATTERN = r"([\w.+-]+@[a-z\d-]+\.[a-z\d.-]+)"
PUNCT_PATTERN = r"[,|@|\|?|\\|$&*|%|\r|\n|.:|\s+|/|//|\\|/|\||-|<|>|;|(|)|=|+|#|-|\"|[-\]]|{|}]"
# Negative Lookbehind for EmailId replacement- Don't match any number which follows the text "RetainedEmailId"
NUMER_PATTERN = r"(?<!RetainedEmailId)(\d+(?:\.\d+)?)"
# Define a function to treat the texts
def cleanseText(text):
# Make the text unicase (lower)
text = str(text).lower()
# Remove email adresses
# text = re.sub(EMAIL_PATTERN, '', text, flags=re.IGNORECASE)
# Save Email addresses and replace them with custom keyword
email_dict = extract_email(text)
for key in email_dict.keys():
text = text.replace(email_dict[key], key)
# Remove all numbers
text = re.sub(NUMER_PATTERN, '', text)
# Replace all punctuations with blank space
# text = re.sub(PUNCT_PATTERN, " ", text, flags=re.MULTILINE)
text = text.translate(str.maketrans("","", string.punctuation))
text = re.sub(r'\s+', ' ', text)
# Replace multiple spaces from prev step to single
text = re.sub(r' {2,}', " ", text, flags=re.MULTILINE)
text = text.replace('`',"'")
# Replace the email ids back into their original position
for key in email_dict.keys():
text = text.replace(key, email_dict[key])
return text.strip()
def extract_email(text):
# Replaces the email addresses with custom key word and
# save them into a dictionary for future use
unique_emailid = set(re.findall(EMAIL_PATTERN, text))
email_replacement = dict()
for idx, email in enumerate(unique_emailid):
email_replacement[f'RetainedEmailId{idx}'] = email
return email_replacement
# Take an example of row# 32 Description and fix it
print('\033[1mOriginal text:\033[0m')
print(ticket['Description'][32])
print('_'*100)
print('\033[1mCleaned text:\033[0m')
print(cleanseText(ticket['Description'][32]))
# Apply the cleaning function to entire dataset
ticket['Description'] = ticket['Description'].apply(cleanseText)
ticket['Short description'] = ticket['Short description'].apply(cleanseText)
# Verify the data
ticket.tail()
Comments:
Now with a nice and cleaner data in our hand let's proceed towards Lemmatization.
Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing.
In grammar, inflection is known as the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change.
Stemming
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.
Lemmatization
Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.
The spaCy library is one of the most popular NLP libraries along with NLTK which contains only one, but the best algorithm to solve any Natural Language problem. Once it is downloaded and installed, the next step is to download the language model, which is used to perform a variety of NLP tasks.
Installation:
using pypi: !pip install spacy
using conda: conda install -c conda-forge spacy
Language Model Download:
$ python -m spacy download en_core_web_md
# Initialize spacy 'en' medium model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
# Define a function to lemmatize the descriptions
def lemmatizer(sentence):
# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)
return " ".join([token.lemma_ for token in doc if token.lemma_ !='-PRON-'])
# Take an example of row# 43 Description and lemmatize it
print('\033[1mOriginal text:\033[0m')
print(ticket['Description'][43])
print('_'*100)
print('\033[1mLemmatized text:\033[0m')
print(lemmatizer(ticket['Description'][43]))
# Apply the Lemmatization to entire dataset
ticket['Description'] = ticket['Description'].apply(lemmatizer)
ticket['Short description'] = ticket['Short description'].apply(lemmatizer)
# Verify the data
ticket.tail()
# Serialize the preprocessed dataset
ticket.to_csv('preprocessed_ticket.csv', index=False, encoding='utf_8_sig')
with open('preprocessed_ticket.pkl','wb') as f:
pickle.dump(ticket, f, pickle.HIGHEST_PROTOCOL)
# Create new features of length and word count for both of the description columns
ticket.insert(1, 'sd_len', ticket['Short description'].astype(str).apply(len))
ticket.insert(2, 'sd_word_count', ticket['Short description'].apply(lambda x: len(str(x).split())))
ticket.insert(4, 'desc_len', ticket['Description'].astype(str).apply(len))
ticket.insert(5, 'desc_word_count', ticket['Description'].apply(lambda x: len(str(x).split())))
ticket.head()
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
Visually representing the content of a text document is one of the most important tasks in the field of text mining. It helps not only to explore the content of documents from different aspects and at different levels of details, but also helps in summarizing a single document, show the words and topics, detect events, and create storylines.
We'll be using plotly library to generate the graphs and visualizations. We need cufflinks to link plotly to pandas dataframe and add the iplot method
Installation:
using pypi: !pip install plotly cufflinks
using conda: conda install -c conda-forge plotly cufflinks
# Check the version of Plotly and Cufflinks packages
print('Plotly:', py.__version__)
print('Cufflinks:', cf.__version__)
Single-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute. Univariate visualization includes histogram, bar plots and line charts.
Plots how the assignments groups are scattered across the dataset. The bar chart, histogram and pie chart tells the frequency of any ticket assigned to any group OR the tickets count for each group.
# Assignment group distribution
print('\033[1mTotal assignment groups:\033[0m', ticket['Assignment group'].nunique())
# Histogram
ticket['Assignment group'].iplot(
kind='hist',
xTitle='Assignment Group',
yTitle='count',
title='Assignment Group Distribution- Histogram (Fig-1)')
# Pie chart
assgn_grp = pd.DataFrame(ticket.groupby('Assignment group').size(),columns = ['Count']).reset_index()
assgn_grp.iplot(
kind='pie',
labels='Assignment group',
values='Count',
title='Assignment Group Distribution- Pie Chart (Fig-2)',
hoverinfo="label+percent+name", hole=0.25)
# Bar plot
ticket['Assignment group'].iplot(
kind='bar',
yTitle='Assignment Group',
xTitle='Record #',
colorscale='-plotly',
title='Assignment Group Distribution- Bar Chart (Fig-3)')
The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true and practitioners have been presumably considered it for sample sizes over 30. All this is saying is that as you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.
Hence let's identify such Assignment groups which doesn't have atleast 30 tickets assigned to them and categorize them as rare group.
# Find out the Assignment Groups with less than equal to 30 tickets assigned
rare_ticket = ticket.groupby(['Assignment group']).filter(lambda x: len(x) <= 30)
print('\033[1m#Groups with less than equal to 30 tickets assigned:\033[0m', rare_ticket['Assignment group'].nunique())
rare_ticket['Assignment group'].iplot(
kind='hist',
xTitle='Assignment Group',
yTitle='count',
colorscale='-orrd',
title='#Records by rare Assignment Groups- Histogram (Fig-4)')
# Distribution of Assignment groups excluding GRP_0 & rare groups (groups with less than equal 30 tickets assigned)
excluded_grp = ['GRP_0']
excluded_grp.extend(rare_ticket['Assignment group'].unique())
filtered_tkt = ticket[~ticket['Assignment group'].isin(excluded_grp)]
# Pie chart
filtered_assgn_grp = pd.DataFrame(filtered_tkt.groupby('Assignment group').size(),columns = ['Count']).reset_index()
filtered_assgn_grp.iplot(
kind='pie',
labels='Assignment group',
values='Count',
title='#Records by Assignment groups(excluding GRP_0 and rare groups)- Pie Chart (Fig-5)',
pull=np.linspace(0,0.3,filtered_assgn_grp['Assignment group'].nunique()))
# Histogram
filtered_tkt['Assignment group'].iplot(
kind='histogram',
xTitle='Assignment Group',
yTitle='count',
colorscale='-gnbu',
title='#Records by Assignment groups(excluding GRP_0 and rare groups)- Histogram (Fig-6)')
Comments:
Plots how the callers are associated with tickets and what are the assignment groups they most frequently raise tickets for.
# Find out top 10 callers in terms of frequency of raising tickets in the entire dataset
print('\033[1mTotal caller count:\033[0m', ticket['Caller'].nunique())
df = pd.DataFrame(ticket.groupby(['Caller']).size().nlargest(10), columns=['Count']).reset_index()
df.iplot(kind='pie',
labels='Caller',
values='Count',
title='Top 10 caller- Pie Chart (Fig-7)',
colorscale='-spectral',
pull=[0,0,0,0,0.05,0.1,0.15,0.2,0.25,0.3])
# Top 5 callers in each assignment group
top_n = 5
s = ticket['Caller'].groupby(ticket['Assignment group']).value_counts()
caller_grp = pd.DataFrame(s.groupby(level=0).nlargest(top_n).reset_index(level=0, drop=True))
caller_grp.head(15)
# Visualize Top 5 callers in each of top 10 assignment groups
top_n = 10
top_grps = assgn_grp.nlargest(top_n, 'Count')['Assignment group'].tolist()
fig_cols = 5
fig_rows = int(np.ceil(top_n/fig_cols))
fig, axes = plt.subplots(fig_rows, fig_cols, figsize=(13,9.5))
fig.suptitle('Top 5 callers in each of top 10 assignment groups- Pie Chart (Fig-8)', y=1, va= 'bottom', size='20')
for row in range(fig_rows):
for col in range(fig_cols):
grp_n = fig_cols * row + col
if grp_n < top_n:
xs = caller_grp.xs(top_grps[grp_n])
_ = axes[row,col].pie(xs, autopct='%1.1f%%', explode=[0.05]*5)
axes[row,col].legend(labels=xs.index,loc="best")
axes[row,col].axis('equal')
axes[row,col].set_title(top_grps[grp_n])
plt.tight_layout()
# Check if any caller appears to raise ticket for multiple groups
mul_caller = caller_grp[caller_grp.Caller.duplicated()]
uni_mul_caller = [idx[1] for idx in mul_caller.index[mul_caller.Caller.unique()]]
print(f'\033[1mFollowing {len(uni_mul_caller)} callers happen to raise tickets for multiple groups:\033[0m\n')
print(uni_mul_caller)
mul_caller
Comments:
Plots the variation of length and word count of Short description attribute
# Short Desc text length
ticket['sd_len'].iplot(
kind='scatter',
xTitle='text length',
yTitle='count',
title='Short Desc. Text Length Distribution (Fig-9)')
# Short desc word count
ticket['sd_word_count'].iplot(
kind='hist',
bins=100,
xTitle='word count',
linecolor='black',
yTitle='count',
colorscale='pastel1',
title='Short desc. Word Count Distribution (Fig-10)')
Plots the variation of length and word count of Description attribute
# Description text length
ticket['desc_len'].iplot(
kind='bar',
xTitle='text length',
yTitle='count',
colorscale='-ylgn',
title='Description Text Length Distribution (Fig-11)')
# Description word count
ticket['desc_word_count'].iplot(
kind='bar',
xTitle='word count',
linecolor='black',
yTitle='count',
colorscale='-bupu',
title='Description Word Count Distribution (Fig-12)')
Plots the variation of tickets recieved in different languages
# Find out the volume of tickets received in different languages other than English
print('\033[1mTotal languages detected:\033[0m', ticket.Language.nunique())
df = pd.DataFrame(ticket.groupby(['Language']).size(), columns=['Count']).reset_index()
# Pie Chart
df[df.Language != 'English'].iplot(kind='pie',
labels='Language',
values='Count',
title='Detected Language distribution- Pie Chart (Fig-13)',
colorscale='plotly',
pull=np.linspace(0,0.2,df.Language.nunique() - 1))
Comments:
N-gram is a contiguous sequence of N items from a given sample of text or speech, in the fields of computational linguistics and probability. The items can be phonemes, syllables, letters, words or base pairs according to the application. N-grams are used to describe the number of words used as observation points, e.g., unigram means singly-worded, bigram means 2-worded phrase, and trigram means 3-worded phrase.
We'll be using scikit-learn’s CountVectorizer function to derive n-grams and compare them before and after removing stop words. Stop words are a set of commonly used words in any language. We'll be using english corpus stopwords and extend it to include some business specific common words considered to be stop words in our case.
But before that let's merge the Short descrition and Description column texts and write a generic method to derive the n-grams.
# Merge the Short descrition and Description column texts to create a new column
ticket.insert(loc=8,
column='Summary',
allow_duplicates=True,
value=list(ticket['Short description'].str.strip() + ' ' + ticket['Description'].str.strip()))
# Extend the English Stop Wordss
STOP_WORDS = STOPWORDS.union({'yes','na','hi',
'receive','hello',
'regards','thanks',
'from','greeting',
'forward','reply',
'will','please',
'see','help','able'})
# Generic function to derive top N n-grams from the corpus
def get_top_n_ngrams(corpus, top_n=None, ngram_range=(1,1), stopwords=None):
vec = CountVectorizer(ngram_range=ngram_range,
stop_words=stopwords).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:top_n]
# Top 50 Unigrams before removing stop words
top_n = 50
ngram_range = (1,1)
uni_grams = get_top_n_ngrams(ticket.Summary, top_n, ngram_range)
df = pd.DataFrame(uni_grams, columns = ['Summary' , 'count'])
df.groupby('Summary').sum()['count'].sort_values(ascending=False).iplot(
kind='bar',
yTitle='Count',
linecolor='black',
colorscale='piyg',
title=f'Top {top_n} Unigrams in Summary')
# Top 50 Unigrams after removing stop words
uni_grams_sw = get_top_n_ngrams(ticket.Summary, top_n, ngram_range, stopwords=STOP_WORDS)
df = pd.DataFrame(uni_grams_sw, columns = ['Summary' , 'count'])
df.groupby('Summary').sum()['count'].sort_values(ascending=False).iplot(
kind='bar',
yTitle='Count',
linecolor='black',
colorscale='-piyg',
title=f'Top {top_n} Unigrams in Summary without stop words')
# Top 50 Bigrams before removing stop words
top_n = 50
ngram_range = (2,2)
bi_grams = get_top_n_ngrams(ticket.Summary, top_n, ngram_range)
df = pd.DataFrame(bi_grams, columns = ['Summary' , 'count'])
df.groupby('Summary').sum()['count'].sort_values(ascending=False).iplot(
kind='bar',
yTitle='Count',
linecolor='black',
colorscale='piyg',
title=f'Top {top_n} Bigrams in Summary')
# Top 50 Bigrams after removing stop words
bi_grams_sw = get_top_n_ngrams(ticket.Summary, top_n, ngram_range, stopwords=STOP_WORDS)
df = pd.DataFrame(bi_grams_sw, columns = ['Summary' , 'count'])
df.groupby('Summary').sum()['count'].sort_values(ascending=False).iplot(
kind='bar',
yTitle='Count',
linecolor='black',
colorscale='-piyg',
title=f'Top {top_n} Bigrams in Summary without stop words')
# Top 50 Trigrams before removing stop words
top_n = 50
ngram_range = (3,3)
tri_grams = get_top_n_ngrams(ticket.Summary, top_n, ngram_range)
df = pd.DataFrame(tri_grams, columns = ['Summary' , 'count'])
df.groupby('Summary').sum()['count'].sort_values(ascending=False).iplot(
kind='bar',
yTitle='Count',
linecolor='black',
colorscale='piyg',
title=f'Top {top_n} Trigrams in Summary')
# Top 50 Trigrams after removing stop words
tri_grams_sw = get_top_n_ngrams(ticket.Summary, top_n, ngram_range, stopwords=STOP_WORDS)
df = pd.DataFrame(tri_grams_sw, columns = ['Summary' , 'count'])
df.groupby('Summary').sum()['count'].sort_values(ascending=False).iplot(
kind='bar',
yTitle='Count',
linecolor='black',
colorscale='-piyg',
title=f'Top {top_n} Trigrams in Summary without stop words')
A word cloud is a collection, or cluster, of words depicted in different sizes. The bigger and bolder the word appears, the more often it’s mentioned within a given text and the more important it is.
Also known as tag clouds or text clouds, these are ideal ways to pull out the most pertinent parts of textual data, often also help business users compare and contrast two different pieces of text to find the wording similarities between the two.
Let's write a generic method to generate Word Clouds for both Short and Long Description columns.
def generate_word_clod(corpus):
# mask = np.array(Image.open('cloud2.png'))
# Instantiate the wordcloud object
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white',
stopwords=STOP_WORDS,
# mask=mask,
min_font_size = 10).generate(corpus)
# plot the WordCloud image
plt.figure(figsize = (12, 12), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
# Word Cloud for all tickets assigned to GRP_0
generate_word_clod(' '.join(ticket[ticket['Assignment group'] == 'GRP_0'].Summary.str.strip()))
# Generate wordcloud for ticket Short description
generate_word_clod(' '.join(ticket['Short description'].str.strip()))
# Generate wordcloud for ticket Description
generate_word_clod(' '.join(ticket.Description.str.strip()))
# Generate wordcloud for ticket Summary
generate_word_clod(' '.join(ticket.Summary.str.strip()))
# Serialize the dataset after EDA
with open('model_ready.pkl','wb') as f:
pickle.dump(ticket, f, pickle.HIGHEST_PROTOCOL)
Comments:
Let's proceed towards trying different model architectures mentioned below to classify the problem and validate which one is outperforming.
Let's create another column of categorical datatype from Assignment groups. Let's write some generic methods for utilities and to plot evaluation metrics.
# Create a target categorical column
ticket['target'] = ticket['Assignment group'].astype('category').cat.codes
ticket.info()
# A class that logs the time
class Timer():
'''
A generic class to log the time
'''
def __init__(self):
self.start_ts = None
def start(self):
self.start_ts = time()
def stop(self):
return 'Time taken: %2fs' % (time()-self.start_ts)
timer = Timer()
# A method that plots the Precision-Recall curve
def plot_prec_recall_vs_thresh(precisions, recalls, thresholds):
plt.figure(figsize=(10,5))
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend()
# A method to train and test the model
def run_classification(estimator, X_train, X_test, y_train, y_test, arch_name=None, pipelineRequired=True, isDeepModel=False):
timer.start()
# train the model
clf = estimator
if pipelineRequired :
clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', estimator),
])
if isDeepModel :
clf.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=10, batch_size=128,verbose=1,callbacks=call_backs(arch_name))
# predict from the claffier
y_pred = clf.predict(X_test)
y_pred = np.argmax(y_pred, axis=1)
y_train_pred = clf.predict(X_train)
y_train_pred = np.argmax(y_train_pred, axis=1)
else :
clf.fit(X_train, y_train)
# predict from the claffier
y_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)
print('Estimator:', clf)
print('='*80)
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100))
print('='*80)
print('Confusion matrix:\n %s' % (confusion_matrix(y_test, y_pred)))
print('='*80)
print('Classification report:\n %s' % (classification_report(y_test, y_pred)))
print(timer.stop(), 'to run the model')
# Create training and test datasets with 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(ticket.Summary,
ticket.target,
test_size=0.20,
random_state=42)
print('\033[1mShape of the training set:\033[0m', X_train.shape, X_test.shape)
print('\033[1mShape of the test set:\033[0m', y_train.shape, y_test.shape)
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.
Advantages:
Disadvantages:
run_classification(MultinomialNB(), X_train, X_test, y_train, y_test)
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification.
Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.
The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.
A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the data.
run_classification(KNeighborsClassifier(), X_train, X_test, y_train, y_test)
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.
The advantages of support vector machines are based on scikit-learn page:
The disadvantages of support vector machines include:
# SVM with Linear kernel
run_classification(LinearSVC(), X_train, X_test, y_train, y_test)
# SVM with RBF kernel
run_classification(SVC(kernel='rbf'), X_train, X_test, y_train, y_test)
Decision tree classifiers are utilized as a well known classification technique in different pattern recognition issues, for example, image classification and character recognition (Safavian & Landgrebe, 1991). Decision tree classifiers perform more successfully, specifically for complex classification problems, due to their high adaptability and computationally effective features. Besides, decision tree classifiers exceed expectations over numerous typical supervised classification methods (Friedl & Brodley, 1997).
In particular, no distribution assumption is needed by decision tree classifiers regarding the input data. This particular feature gives to the Decision Tree Classifiers a higher adaptability to deal with different datasets, whether numeric or categorical, even with missing data. Also, decision tree classifiers are basically nonparametric. Also, decision trees are ideal for dealing with nonlinear relations among features and classes. At long last, the classification procedure through a tree-like structure is constantly natural and interpretable.
run_classification(DecisionTreeClassifier(), X_train, X_test, y_train, y_test)
Random forests or random decision forests technique is an ensemble learning method for text classification. This method was introduced by T. Kam Ho in 1995 for first time which used t trees in parallel. This technique was later developed by L. Breiman in 1999 that they found converged for RF as a margin measure.
run_classification(RandomForestClassifier(n_estimators=100), X_train, X_test, y_train, y_test)
Observations:
A Neural Network, unlike statistical ML algorithms, is a series of algorithms that endeavors to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. Neural networks can adapt to changing input; so the network generates the best possible result without needing to redesign the output criteria.
Define a function for checkpoints
#Path where you want to save the weights, model and checkpoints
model_path = "Weights/"
%mkdir Weights
# Define model callbacks
def call_backs(name):
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=100)
model_checkpoint = ModelCheckpoint(model_path + name + '_epoch{epoch:02d}_loss{val_loss:.4f}.h5',
monitor='val_loss',
verbose=1,
save_best_only=True,
save_weights_only=False,
mode='min',
period=1)
return [model_checkpoint, early_stopping]
Deep Neural Networks architectures are designed to learn through multiple connection of layers where each single layer only receives connection from previous and provides connections only to the next layer in hidden part. The input is a connection of feature space (As discussed in Section Feature_extraction with first hidden layer. For Deep Neural Networks (DNN), input layer could be tf-ifd, word embedding, or etc. as shown in standard DNN in Figure. The output layer houses neurons equal to the number of classes for multi-class classification and only one neuron for binary classification. But our main contribution in this paper is that we have many trained DNNs to serve different purposes. Here, we have multi-class DNNs where each learning model is generated randomly (number of nodes in each layer as well as the number of layers are randomly assigned). Our implementation of Deep Neural Network (DNN) is basically a discriminatively trained model that uses standard back-propagation algorithm and sigmoid or ReLU as activation functions. The output layer for multi-class classification should use Softmax.
# Function to build Deep NN
def Build_Model_DNN_Text(shape, nClasses, dropout=0.3):
"""
buildModel_DNN_Tex(shape, nClasses,dropout)
Build Deep neural networks Model for text classification
Shape is input feature space
nClasses is number of classes
"""
model = Sequential()
node = 512 # number of nodes
nLayers = 4 # number of hidden layer
model.add(Dense(node,input_dim=shape,activation='relu'))
model.add(Dropout(dropout))
for i in range(0,nLayers):
model.add(Dense(node,input_dim=node,activation='relu'))
model.add(Dropout(dropout))
model.add(BatchNormalization())
model.add(Dense(nClasses, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print(model.summary())
return model
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(ticket.Summary)
X_train_tfidf = Tfidf_vect.transform(X_train)
X_test_tfidf = Tfidf_vect.transform(X_test)
# Instantiate the network
model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 75)
model_DNN.fit(X_train_tfidf, y_train,
validation_data=(X_test_tfidf, y_test),
callbacks=call_backs("NN"),
epochs=10,
batch_size=128,
verbose=2)
predicted = model_DNN.predict(X_test_tfidf)
# Check if it is already extracted else Open the zipped file as readonly
if not os.path.isfile('glove.6B/glove.6B.100d.txt'):
glove_embeddings = 'glove.6B.zip'
with ZipFile(glove_embeddings, 'r') as archive:
archive.extractall('glove.6B')
# List the files under extracted folder
os.listdir('glove.6B')
Another deep learning architecture that is employed for hierarchical document classification is Convolutional Neural Networks (CNN) . Although originally built for image processing with architecture similar to the visual cortex, CNNs have also been effectively used for text classification. In a basic CNN for image processing, an image tensor is convolved with a set of kernels of size d by d. These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. To reduce the computational complexity, CNNs use pooling which reduces the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features.
The most common pooling method is max pooling where the maximum element is selected from the pooling window. In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected dense layers. In general, during the back-propagation step of a convolutional neural network not only the weights are adjusted but also the feature detector filters. A potential problem of CNN used for text is the number of ‘channels’, Sigma (size of the feature space). This might be very large (e.g. 50K), for text but for images this is less of a problem (e.g. only 3 channels of RGB). This means the dimensionality of the CNN for text is very high.
gloveFileName = 'glove.6B/glove.6B.200d.txt'
MAX_SEQUENCE_LENGTH = 500
EMBEDDING_DIM=200
MAX_NB_WORDS=75000
# Function to generate Embedding
def loadData_Tokenizer(X_train, X_test,filename):
np.random.seed(7)
text = np.concatenate((X_train, X_test), axis=0)
text = np.array(text)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(text)
sequences = tokenizer.texts_to_sequences(text)
word_index = tokenizer.word_index
text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Found %s unique tokens.' % len(word_index))
indices = np.arange(text.shape[0])
# np.random.shuffle(indices)
text = text[indices]
print(text.shape)
X_train = text[0:len(X_train), ]
X_test = text[len(X_train):, ]
embeddings_index = {}
f = open(filename, encoding="utf8")
for line in f:
values = line.split()
word = values[0]
try:
coefs = np.asarray(values[1:], dtype='float32')
except:
pass
embeddings_index[word] = coefs
f.close()
print('Total %s word vectors.' % len(embeddings_index))
return (X_train, X_test, word_index,embeddings_index)
embedding_matrix = []
def buildEmbed_matrices(word_index,embedding_dim):
embedding_matrix = np.random.random((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
if len(embedding_matrix[i]) !=len(embedding_vector):
print("could not broadcast input array from shape",str(len(embedding_matrix[i])), "into shape",str(len(embedding_vector)),
" Please make sure your"" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,")
exit(1)
embedding_matrix[i] = embedding_vector
return embedding_matrix
# Generate Glove embedded datasets
X_train_Glove, X_test_Glove, word_index, embeddings_index = loadData_Tokenizer(X_train,X_test,gloveFileName)
embedding_matrix = buildEmbed_matrices(word_index,EMBEDDING_DIM)
def Build_Model_CNN_Text(word_index, embeddings_matrix, nclasses,dropout=0.5):
"""
def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):
word_index in word index ,
embeddings_index is embeddings index, look at data_helper.py
nClasses is number of classes,
MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,
EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py
"""
model = Sequential()
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embeddings_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
# applying a more complex convolutional approach
convs = []
filter_sizes = []
layer = 5
print("Filter ",layer)
for fl in range(0,layer):
filter_sizes.append((fl+2))
node = 128
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
for fsz in filter_sizes:
l_conv = Conv1D(node, kernel_size=fsz, activation='relu')(embedded_sequences)
l_pool = MaxPooling1D(5)(l_conv)
#l_pool = Dropout(0.25)(l_pool)
convs.append(l_pool)
l_merge = Concatenate(axis=1)(convs)
l_cov1 = Conv1D(node, 5, activation='relu')(l_merge)
l_cov1 = Dropout(dropout)(l_cov1)
l_batch1 = BatchNormalization()(l_cov1)
l_pool1 = MaxPooling1D(5)(l_batch1)
l_cov2 = Conv1D(node, 5, activation='relu')(l_pool1)
l_cov2 = Dropout(dropout)(l_cov2)
l_batch2 = BatchNormalization()(l_cov2)
l_pool2 = MaxPooling1D(30)(l_batch2)
l_flat = Flatten()(l_pool2)
l_dense = Dense(1024, activation='relu')(l_flat)
l_dense = Dropout(dropout)(l_dense)
l_dense = Dense(512, activation='relu')(l_dense)
l_dense = Dropout(dropout)(l_dense)
preds = Dense(nclasses, activation='softmax')(l_dense)
model = Model(sequence_input, preds)
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print(model.summary())
return model
# Train the network and run classification
model_CNN = Build_Model_CNN_Text(word_index,embedding_matrix, 75)
run_classification(model_CNN, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='CNN')
RNN assigns more weights to the previous data points of sequence. Therefore, this technique is a powerful method for text, string and sequential data classification. Moreover, this technique could be used for image classification as we did in this work. In RNN, the neural net considers the information of previous nodes in a very sophisticated method which allows for better semantic analysis of the structures in the dataset.
Gated Recurrent Unit (GRU)
Gated Recurrent Unit (GRU) is a gating mechanism for RNN which was introduced by J. Chung et al. and K.Cho et al.. GRU is a simplified variant of the LSTM architecture, but there are differences as follows: GRU contains two gates and does not possess any internal memory (as shown in Figure; and finally, a second non-linearity is not applied (tanh in Figure).
def Build_Model_RNN_Text(word_index, embeddings_matrix, nclasses,dropout=0.5):
"""
def buildModel_RNN(word_index, embeddings_matrix, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=100, dropout=0.5):
word_index in word index ,
embeddings_matrix is embeddings_matrix, look at data_helper.py
nClasses is number of classes,
MAX_SEQUENCE_LENGTH is maximum lenght of text sequences
"""
model = Sequential()
hidden_layer = 3
gru_node = 32
model.add(Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embeddings_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
print(gru_node)
for i in range(0,hidden_layer):
model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(BatchNormalization())
model.add(GRU(gru_node, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(BatchNormalization())
model.add(Dense(256, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(nclasses, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
print(model.summary())
return model
# Train the network and run classification
model_RNN = Build_Model_RNN_Text(word_index,embedding_matrix, 75)
run_classification(model_RNN, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='RNN')
Recurrent Convolutional Neural Networks (RCNN) is also used for text classification. The main idea of this technique is capturing contextual information with the recurrent structure and constructing the representation of text using a convolutional neural network. This architecture is a combination of RNN and CNN to use advantages of both technique in a model.
def Build_Model_RCNN_Text(word_index, embeddings_matrix, nclasses):
kernel_size = 2
filters = 256
pool_size = 2
gru_node = 256
model = Sequential()
model.add(Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embeddings_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(LSTM(gru_node, recurrent_dropout=0.2))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(Dense(1024,activation='relu'))
model.add(Dense(nclasses))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
print(model.summary())
return model
# Train the network and run classification
model_RCNN = Build_Model_CNN_Text(word_index,embedding_matrix, 75)
run_classification(model_RCNN, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='RCNN')
Long Short-Term Memory~(LSTM) was introduced by S. Hochreiter and J. Schmidhuber and developed by many research scientists.
To deal with these problems Long Short-Term Memory (LSTM) is a special type of RNN that preserves long term dependency in a more effective way compared to the basic RNNs. This is particularly useful to overcome vanishing gradient problem as LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. The figure shows the basic cell of a LSTM model.
EMBEDDING_DIM = 100
gloveFileName = 'glove.6B/glove.6B.100d.txt'
from keras.models import Sequential
from keras.layers import Dense, LSTM, TimeDistributed, Activation
from keras.layers import Flatten, Permute, merge, Input
from keras.layers import Embedding
from keras.models import Model
from keras.layers import Input, Dense, multiply, concatenate, Dropout
from keras.layers import GRU, Bidirectional
def Build_Model_LTSM_Text(word_index, embeddings_matrix, nclasses):
kernel_size = 2
filters = 256
pool_size = 2
gru_node = 256
model = Sequential()
model.add(Embedding(len(word_index) + 1,
EMBEDDING_DIM,
weights=[embeddings_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True))
model.add(Dropout(0.25))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Conv1D(filters, kernel_size, activation='relu'))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(Bidirectional(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2)))
model.add(Bidirectional(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2)))
model.add(Bidirectional(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2)))
model.add(Bidirectional(LSTM(gru_node, recurrent_dropout=0.2)))
model.add(Dense(1024,activation='relu'))
model.add(Dense(nclasses))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print(model.summary())
return model
X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test,gloveFileName)
embedding_matrix = buildEmbed_matrices(word_index,EMBEDDING_DIM)
model_LTSM = Build_Model_LTSM_Text(word_index,embedding_matrix, 75)
run_classification(model_LTSM, X_train_Glove, X_test_Glove, y_train, y_test,pipelineRequired = False,isDeepModel=True, arch_name='LSTM')
Out of all the model architectures we've tried, the accuracy of each of the model is as follows in the table. Statistical models are overfitted to a higher degree. One obvious reason is the dataset is highly imbalanced. And Neural networks need to be fine tuned to increase accuracy. Following are some of the techniques we'll be trying in Milestone-2 as part of fine tuning.